Skip to content

Conversation

@mapellidario
Copy link
Member

@mapellidario mapellidario commented Aug 21, 2025

related to #12228

Status

not tested

  • vocms0262, apply the patch.
    • backward compatibility. code running in an agent without changes to the secrets or config.py [3]
    • new feature. if we set the proper jdl, than it seems ok [5]
    • new feature. implement proper jobwrapper checks, init scripts feature flags, protections, ...

Description

It is straightforward to enable the use of tokens in remote jobs, it is enough to add the following classad to the jobs jdl

use_oauth_services = cms_wmagent

However, It is not trivial to decide when it is safe to enable this feature. For example, if the file /var/lib/condor/oauth_credentials/cmst1/cms_wmagent.use does not exist in the wmagent VM, than condor_submit fails.

We could use the condor cli or python bindings to check if a token is present [1], but these interfaces are likely to change in the future, see [2]. We need a solution that is tested, stable and reliable for years to come.

After a quick chat with Alan, I propose to use a feature flag to enable this. This would allow not to rely on any additional condor interface, we would just trust that every WMAgent is equipped with a proper token. This simplicity comes at the expense of manual intervention from operators in order to disable tokens in case of infrastructure malfunctions.

I decided not to use a true/false feature flag because I want the application to be ready in case of another token name change.

(edit) I had another chat with Alan. We agreed that:

  • we always enable the feature on all the schedds in the wmagent secrets.
  • we add a check based on /usr/sbin/condor_store_cred in the wmagent-docker-run.sh script. if the token is not valid, then the wmagent is not initialized
  • we do not check the token while the agent is running
  • if there is a problem with token infrastructure and wmagent can not submit new jobs, operators will change config.py disabling the token support, then restart the component JobSubmitter

how to operate the agent after this PR is merged

enable

enable the feature adding the following to WMagent.secrets, then init the agent

OAUTH_CMS_TOKEN_NAME=cms_wmagent

If you can not initialize an agent, then change /data/dockerMount/srv/wmagent/current/config/config.py as follows and restart the agent

- config.JobSubmitter.authCMSTokenName = ""
+ config.JobSubmitter.authCMSTokenName = "cms_wmagent"
disable

We do not support starting an agent without a valid token

Change /data/dockerMount/srv/wmagent/current/config/config.py as follows and restart the agent

- config.JobSubmitter.authCMSTokenName = "cms_wmagent"
+ config.JobSubmitter.authCMSTokenName = ""

Is it backward compatible (if not, which system it affects?)

yes. if the WMAgent.secrets file is not touched, then the value from etc/WMAgentConfig.py is used, which disables the use of tokens

Related PRs

PR for CMSKubernetes: dmwm/CMSKubernetes#1642

External dependencies / deployment changes

nope :)


[1] condor cli

cmst1@vocms0262:tokens $ /usr/sbin/condor_store_cred query-oauth -u cmst1@cms
Account: cmst1@cms
CredType: oauth
A credential was stored and is valid.
Credential info:
cms_wmagent.top = 1755208672
cms_wmagent.use = 1755270850
fully_qualified_user = "cmst1@cms"

condor python bindings v24.0 with AP running 24.0

(WMAgent-2.4.2rc7) [cmst1@vocms0262:current]$ python3
Python 3.12.11 (main, Jun 10 2025, 23:56:19) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import htcondor
>>> col = htcondor.Collector()
>>> col.locate(htcondor.DaemonTypes.Credd)
[ Name = "vocms0262.cern.ch"; MyType = "CredD"; Machine = "vocms0262.cern.ch"; MyAddress = "<188.185.7.194:4080?addrs=[2001-1458-d00-62--100-329]-4080+188.185.7.194-4080&alias=vocms0262.cern.ch&noUDP&sock=credd_195986_9ddb>"; CondorVersion = "$CondorVersion: 24.0.9 2025-06-26 BuildID: UW_Python_Wheel_Build $"; CondorPlatform = "$CondorPlatform: X86_64-AlmaLinux_8.10 $" ]

[2] https://its.cern.ch/jira/browse/CMSSI-124?focusedId=7068943&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-7068943

After extensive tests, I noticed that this area of condor is still under development and features are still added and polished. For example the new python bindings htcondor2 introduce new functions wrt htcondor. the bindings for 24.0 do not match the documentations. the bindings htcondor2 for 24.X are not compatible with 24.0, etc etc.

[3]

used wf dmapelli_SC_ProdPsi_test_tokens_v2_250822_181225_919 on test11 and vocms0262

2025-08-22 18:08:53,606:140712058281024:INFO:JobSubmitterPoller:[tokens] remote jobs will not contain oauth tokens.
2025-08-22 18:08:53,607:140712058281024:INFO:JobSubmitterPoller:[tokens] enable them:
[tokens] - change config.JobSubmitter.authCMSTokenName in /data/dockerMount/srv/wmagent/current/config/config.py
[tokens] - restart the agent
[tokens] otherwise, if you can initialize the agent from scratch:
[tokens] - set OAUTH_CMS_TOKEN_NAME in WMAgent.secrets
[tokens] - initialize the new agent

wf completed ok

[4]

edit config.py

(WMAgent-2.4.2) [cmst1@vocms0262:current]$ cat /data/srv/wmagent/current/config/config.py | grep -i token
#config.JobSubmitter.useOauthToken = False
config.JobSubmitter.oauthCMSTokenName = "cms_wmagent"

restart JobSubmitter, job content seem ok

cmst1@vocms0262:tokens $ condor_q


-- Schedd: vocms0262.cern.ch : <188.185.7.194:4080?... @ 08/28/25 18:34:35
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI    SIZE CMD
 354.0   cmst1           8/28 17:52   0+00:40:35 R  600000 1954 submit_py3.sh
[...]
cmst1@vocms0262:tokens $ condor_tail -maxbytes 102400 354.0
[...]
======= WMAgent token verification at Thu Aug 28 15:54:03 GMT 2025 ========

Content under _CONDOR_CREDS: /srv/.condor_creds
total 4
-rw-------. 1 cmsplt01 zh 1706 Aug 28 17:54 cms_wmagent.use
[...]

@dmwm-bot

This comment was marked as outdated.

@dmwm-bot

This comment was marked as outdated.

@dmwm-bot

This comment was marked as outdated.

@dmwm-bot

This comment was marked as outdated.

@mapellidario
Copy link
Member Author

Additional logic needs to be implemented.

  • we need to keep allow/deny lists of cmssw versions that support token auth for stagein and stageout
    • enable token in a remote job only if all the cmssw versions support tokens
  • we need to keep allow/deny lists of sistes that support token auth for stagein and stageout
    • enable token in a remote job only if all the sites where the job can run support tokens

we could add four configuration parameters in config.py with lists of regexs. It's not clear to me if we want to check this at job submission level or at WQE level.

see discussion at https://its.cern.ch/jira/projects/CMSSI/issues/CMSSI-124?filter=allissues

@amaltaro
Copy link
Contributor

amaltaro commented Oct 9, 2025

As discussed with Kenyi and Andrea, we agreed on the following steps (on top of Dario's comment above):

  • is the token infrastructure ready for us to proceed? (talk to Stephan L)
  • once the new htcondor version is out, we can try out the htcondor API for validating the token (where this validation is likely going to be minimal for this functionality - expiration? cms scope?)
  • we shall not support site/storage-based token job submission (talk to Stephan L)
  • for CMSSW releases, we need to understand what the actual requirements are. Depending on these requirements, it might be inefficient to make this check against a list of CMSSW releases that might be requested by a given job. (talk to Shahzad + Stephan L)
  • change JobSubmitter to check the token before each(?) cycle to decide whether to use token or x509
  • cross-check whether CERN and FNAL nodes are properly configured to use tokens (puppet@CERN and HyunWoo@FNAL) - (talk to Florian and HyunWoo)
  • for the moment that we no longer use x509 for job submission, we would have to stop reporting its validity to WMStats (from AgentStatusWatcher), as we will probably no longer keep Alan's proxy up-to-date in the agents.
    • Somehow an orthogonal development. Suggestion is to make an AgentStatusWatcher configuration hook to disable this check.
    • Andrea suggests to modify the original GH issue description.

@stlammel
Copy link

There are two cases of (un)available token support to distinguish:

  • no token support on the WMAgent machine or at some site
  • no token capability or broken token support of the workflow
    I expect EOS token commissioning to start in December and complete in March of next year before data taking resumes. Checking at startup if a WMAgent machine is token enabled and setting config.JobSubmitter.authCMSTokenName based on this looks fine to me.
    For workflows that can't handle token a WMAgent without token config and such workflows directed there seems necessary to me. Basing this on CMSSW/xrootd version of the workflow seems fine.
    Dropping x509 support is at a different, longer timescale than the current step of providing the job with a token in addition to the x509 and probably over a year away.
    Thanks,
  • Stephan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants